Fix encoding of non-ascii field names in ignored source#131950
Closed
jordan-powers wants to merge 4 commits intoelastic:mainfrom
Closed
Fix encoding of non-ascii field names in ignored source#131950jordan-powers wants to merge 4 commits intoelastic:mainfrom
jordan-powers wants to merge 4 commits intoelastic:mainfrom
Conversation
Collaborator
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Collaborator
|
Hi @jordan-powers, I've created a changelog YAML for you. |
martijnvg
approved these changes
Jul 28, 2025
Member
martijnvg
left a comment
There was a problem hiding this comment.
Can you confirm my thinking in the comment I left? Otherwise LGTM.
| byte[] nameBytes = values.name.getBytes(StandardCharsets.UTF_8); | ||
| byte[] bytes = new byte[4 + nameBytes.length + values.value.length]; | ||
| ByteUtils.writeIntLE(values.name.length() + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0); | ||
| ByteUtils.writeIntLE(nameBytes.length + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0); |
Member
There was a problem hiding this comment.
Just double checking, there is no need for an index version check here, given that decode isn't updated in this change. In other words, Indexing new documents in indices with older index version, would result in the error described in the PR description to not occur.
Contributor
Author
|
I realized that I can update the decode to handle the old format. This way we won't lose non-ascii data written in the old format. I've opened another PR: #132018 |
Contributor
Author
|
Closed in favor of the solution in #132018 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When encoding an ignored source entry, we use
String.length()to get the length of the encoded field name. This will only work when the UTF-8 encoding has only ascii characters, with 1 byte per character.The solution is to use the actual length of the encoded field name byte[] array.
I added a test that encodes a field in _ignored_source with a random unicode key. Without this fix, it fails with this stack trace: